July 26, 2016
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Our dataset consists of thirteen variables with 4898 observations, the quality of wine has a median of 6 with min of 3 and max of 9. Some wines have no citric acid added, which can add ‘freshness’ and flavor to wines. Quality is the output attribute, 11 input variables (based on physicochemical tests) could be relevent,we will explore it in depth.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Wines quality is scored from 0~10, in which 0 is the worest and 10 is the best. Quality histogram appears normal distribution, best quality is 9, most wine’s quality is scored between 5~6, There are more than 70% of wines in medium quality class.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Above three plots for fixed.acidity, volatitle.acidity and citrix.acid all appear normal distribution with some outliers. Especially the maximized fixed.acidity is reached 14.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.890 7.405 7.467 7.960 14.960
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1527 1527 14.2 0.27 0.49 1.1
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1527 0.037 33 156 0.992 3.15
## sulphates alcohol quality quality.class total.acidity
## 1527 0.54 11.1 6 medium 14.96
I add a new variable called total.acidity, to add up all acid property variables together, the plot appears a normal distribution as well. In the dataset, there is only one wine with total.acidity large than 14, which is quality 6. Becasue of wine brewing features(time, temperture etc.) unkown, I don’t know what caused that.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782 7.8 0.965 0.6 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality quality.class total.acidity
## 2782 0.69 11.7 6 medium 9.365
Distribution of residual.sugar has a long tail on the right side. After tranformed with log10, the distribution appears bimodal with the peaking around 1.5 and 7.5. Residual sugar means the amount of sugar remaining after fermentation stops, normally wine have more than 1 gram/liter sugar and wines with greater than 45 grams/liter are considered sweet. Here, we have minimze sugar is 0.6 and maximize sugar is 65.8. When checking the wine with residual sugar value 65.8, the quality is 6, same as total.acidity high value, I don’t know what caused that either.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chlorides: the amount of salt in the wines, normal distribution, median value is 0.043 and mean is 0.04577, very close to median.
## [1] "Summary of total.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Histograms for free SO2, total SO2 and raio of free SO2, all appear normal distribution. Since sulphate can contribute to total sulfur dioxide levels, it has a similar histogram with the total sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density has a very small range from 0.9871 to 1.0390, very close to water’s density, distribution is normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH: most wines pH values are between 3.0 - 3.4 on the pH scale(from 0 (very acidic) to 14 (very basic)), distribution is normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alochol percentage probably affects the density, pH level and the wine flavors. Just looking at the distributions of different levels quality, seems like the higher level of alcohol, the quality of wines is better.
There are 4898 white wines in the dataset with 13 variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality, and index X).
Quality is the output attribute, scored from 0~10, in which 0 is the worest and 10 is the best, original it’s integer variable(values: 3,4,5,6,7,8,9), 11 input variables(excluded X) are all numerical variables.
Other observations: The best quality of wines is scored 9, which is only 5 quantites, very rare. Most wines quality is in median level 6.
The main features in the data set are quality, which may be correlated with some of these physicochemical attributes. I’d like to find out which attributes influence the quality of white wine. I suspect alcohol and some combination of the other attributes can be used to build a predictive model to quality the wine.
Acidity, residual.sugar, total.sulfur.dioxide, pH likely contribute to quality of wines.
Yes, I create a new variable quality.class, and will use it to analyse the corelation between variables in the next two sections.
I transformed the positive skewed residual.sugar distributions with log10. The tranformed distribution for residual.sugar appears bimodal with the peaking around 1.5 and 7.5.
I also change quality properties to factor, and add a new factor, quality.class(low, medium and high), therefore in the Bivariate and Multivariate sections, I can explore those atttributes with different quality groups.
Looking at the plot matrix, we can find that correlation coefficient between two variables above, the strongest correlations with quality occur with alcohol, density and chlorides(perasion r: 044, -0.31, -0.21). And the strongest correlations with alcohol occur with density, total.sulfur.dioxide, residual.sugar and chlorides (perasion r from -0.78 ~ -0.36).
## wines$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## wines$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wines$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wines$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wines$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wines$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wines$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
In this case, plots show wines with quality.class medium and high tend to have higher alcohol values. The boxplot shows that wines with quality 6~9 have higher alcohol values.
## wines$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## wines$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## wines$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## wines$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## wines$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## wines$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## wines$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
In this case, density vs quality or quality.class scatterplots show wines with quality 6-9/medium-high tend to have lower density, boxplot also display the same trend as scatterplots.
## wines$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## wines$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## wines$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## wines$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## wines$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## wines$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## wines$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
In this case, quality vs chlorides scatterplot shows wines with quality 6-9 tend to have lower chlorides, quality.class vs chlorides scatterplot and boxplot also display the same trend.
We can see alcohol vs density have negitave linear relationship when we ignore the outliers.
Looking at scatterplot, total.sulfur.dixoxide vaules distribute on all level of alcohol, although we can see alcohol tends to decrease while total.sulfur.dioxide increasing in general trend, it’s not a linear relationship.
In gereral trend, with residual.sugar values increasing, alcohol values tend to decrease.
With the cholorides increasing in the range of 0-0.1, alcohol values trend to decrease
When Looking at the plot matrix, we can find the strongest correlations with quality occur with alcohol, density and chlorides(pearsion r: 044, -0.31, -0.21).
Wines quality in the range of 6-9 or quality.class in medium and high, with the alcohol values increasing, wines quality tends to increase as well。 On the contrary, wines quality in the rang of 3-5 or quality.class in low level, with the alcohol increasing, wines quality trends to decrease.
Same correaltions happen on quality vs density and chlorides.
Yes, alcohol has correlations with density, residual.sugar, chlorides. These three variables have negative relationship with alcohol.
My main purpose is to find which chemical properties influence the quality of wines. After comparing the relationship between quality and relavant variables, I found Alcohol has the strongest positive relationship with quality of wines.
Residual.sugar has the strongest relationship with density in the dataset, whose correlation coefficient is 0.84.
Here, plots clearly show wines with higher quality are in the right side of the plots, which is further shown that higher quality wines tend to have high alcohol and low density.
Same as Alcohol vs Density, plots show that higher quality wines tend to have high alcohol values, low residual.sugar and low chlorides.
##
## Calls:
## m1: lm(formula = alcohol ~ density, data = wines)
## m2: lm(formula = alcohol ~ density + residual.sugar, data = wines)
## m3: lm(formula = alcohol ~ density + residual.sugar + chlorides,
## data = wines)
##
## =========================================================
## m1 m2 m3
## ---------------------------------------------------------
## (Intercept) 329.588*** 564.755*** 544.341***
## (3.657) (5.365) (5.626)
## density -320.991*** -558.645*** -537.841***
## (3.679) (5.414) (5.684)
## residual.sugar 0.167*** 0.159***
## (0.003) (0.003)
## chlorides -4.614***
## (0.425)
## ---------------------------------------------------------
## R-squared 0.6 0.7 0.8
## adj. R-squared 0.6 0.7 0.8
## sigma 0.8 0.6 0.6
## F 7613.4 7302.6 5023.8
## p 0.0 0.0 0.0
## Log-likelihood -5668.6 -4580.9 -4522.6
## Deviance 2902.6 1861.6 1817.9
## AIC 11343.1 9169.7 9055.2
## BIC 11362.6 9195.7 9087.7
## N 4898 4898 4898
## =========================================================
Furthermore, according to the multivariate analysis revealed that higher quality wines tend to have high alcohol, low residual.sugar and low chlorides values. Since the plots show there is a linear relationship between alcohol and it’s relavant variables(density, residual.sugar and chlorides), so that I can build a linear model and use this model to predict the alcohol values.
In the low quality group of wines, with quality increasing, alcohol value has decreasing trend and chlorides value has increasing trend, which has opposite trend in the medium ~ high quality group of wines.
Yes, I created a very simple linear model starting from alcohol and density.
The variables in the linear model account for 80% of the variance in the alcohol value of wines. residual.sugar and chlorides variables each imporve the R-squared value by 10%.
Alcohol value is a very important variable in the wines properties, which has the strongest relationship with wines quality. Since I didn’t find the linear relationship between quality and relvant variables, so I choose alcohol as a output to create a linear model. However, wine brewing is a very complated process, there are only fews physicochemical properties in our dataset, it is difficult to make this prodiction more accurated.
The quality of wines can be scored from 0~10, around 75% of wines are scored in quality 5 and 6. There are no wines with quality less than 3 or greater than 9 in this dataset.
With the increase of quality, the means of alcohol value tend to increase in the range of quality 5~9. However, in range of quality 3~5,the means of alcohol value tend to decrease.
With the alcohol values increasing, density tend to decrease, there is a negitave linear relationship between alcohol and density. The plot also shows that wines with higher quality are in the right side of the plots,which is further illustrative that higher quality wines tend to have high alcohol and low density.
This dataset consists of thirteen variables with 4898 observations. My main purpose is to find which chemical properties influence the quality of white wines, and at same time find the relationships between other features.
Firstly, I started to understand the variables by virsualizing the distribution of individual variables and looked for unusual behaviors in the histograms, and I transformed the residual.sugar variable distributions with log10.
Next, I used plot matrix to calculate and plot the correlations between the variables. None of the correlations with quality are above 0.5, the strongest correlation with quality is alcohol. Alcohol has relatively strong correlations with density, residual.sugar and chlorides. Through bivariate visualization analysis, I found that the quality of wine vs alcohol has two different direction relationships.it has negitive relationship with alcohol in quality 3-5, positive in quality 5-9. Alcohol has linear relationhips with density, residual.sugar and chlorides.
Eventually, I explored the quality of wines across with alcohol, density, chlorides. Higher quality wines tend to have high alcohol, low residual.sugar and low chlorides values, so alcohol, density and chlorides infuluence the quality of white wines most. Since the plots show there is a linear relationship betwen alcohol and it’s relavant variables (density, residual.sugar and chlorides), so that I can build a linear model and use this model to predict the alcohol values.
After I did some rearch, I found that wine brewing is a very complicated process. The quality of wine is affected by many factors, such as grape varieties, geographical location and temperature, fermentation temperature and time, the physicochemical properties in our dataset and more. If we got all those information, I believe we could make a very good model to predict the wines quality, and even use this model to optimize the brewing process.